Prototype for storage API #7

jyoune · 2024-07-10T20:16:09Z

Basic skeleton for a prototype storage api for MMIF files that creates nested subdirectories based on the views and corresponding metadata present.

… storage api

keighrim

fastapi is a uncharted territory for me as well. Let's keep researching on it.

prototype/storage_api.py

keighrim · 2024-07-12T15:12:32Z

For reference, this PR addresses the "storage" side of clamsproject/aapb-evaluations#50 .

…storage path

…ters (hashed) in the nested directory structure.

…tted representation of the pipeline you're searching for

…ch a full filepath within the database) and cleaned up other parts of the code.

keighrim

Functionally, looks good, I left some additional feature request/suggestion in the comments.

keighrim · 2024-08-06T14:09:35Z

prototype/storage_api.py

+from typing import List, Dict
+from typing_extensions import Annotated
+import os
+import yaml


this yaml is used only for a limited purpose, namely to load config file. However, the existing code is based on .env file and dotenv module, so let's migrate to environment variable-based configuration to match the existing code.

keighrim · 2024-08-06T14:16:16Z

prototype/storage_api.py

+    return {"message": "Storage api for pipelined mmif files"}
+
+
+@app.route("/upload_mmif/", methods=["POST"])


let's think about the routings here. Existing "baapb resolver" codebase (api/__init__.py) uses many routings, but only one of them (/searchapi/) is a pure API landing point (others are attached to some web page with minimal front end implementation).

Eventually, we want to run a single server app that can handle "assets" (video files) as well as can handle "mmif", meaning this storage_api.py should be merged into the api/ pacakge. To that end, I think we need some basic rules for naming these routing paths.

Since /searchapi is already used in production server (for baapb:// resolution), how about

/upload_mmif >> /storeapi/mmif

/retrieve >> /searchapi/mmif
?

Note that changes in the routing names will also impact the client-side implementation (that you're currently working on).

keighrim · 2024-08-06T14:20:56Z

prototype/storage_api.py

+    # and dump the param dicts
+    os.makedirs(directory, exist_ok=True)
+    for path in param_path_dict:
+        file_path = os.path.join(path, 'parameters.json')


I tried this out, and figured that it would be more convenient for clients if we just store .mmif files only in this directory, and have an additional api to return the entire absolute path of the directory (just like how baapb:// is resolved), so that when we run some code that uses pre-stored mmif files on the same machine where those files are, the client doesn't have to "download" the mmif files, but directly can access the directory (if exists) and use the files under.

And if we store the parameter files in the "appversion" directory, and keeping only mmif files in the "hash" directory, clients can just take * glob to read all mmif files, without worrying about any additional metadata-like files.
So instead of

storage-root/ ├── app1 │ ├── v1 │ │ ├── hash1 │ │ │ ├── guid1.mmif │ │ │ └── params.json │ │ ├── hash2 │ │ │ ├── guid1.mmif │ │ │ └── params.json │ │ └── hash3 │ │ ├── guid1.mmif │ │ └── params.json │ └── v2 │ └── hash1 │ ├── guid1.mmif │ └── params.json └── app2 ├── v1 │ └── hash1 │ ├── guid1.mmif │ └── params.json └── v2 └── hash1 ├── guid1.mmif └── params.json

,

storage-root/ ├── app1 │ ├── v1 │ │ ├── hash1 │ │ │ └── guid1.mmif │ │ ├── hash1.json │ │ ├── hash2 │ │ │ └── guid1.mmif │ │ ├── hash2.json │ │ ├── hash3 │ │ │ └── guid1.mmif │ │ └── hash3.json │ └── v2 │ ├── hash1 │ │ └── guid1.mmif │ └── hash1.json └── app2 ├── v1 │ ├── hash1 │ │ └── guid1.mmif │ └── hash1.json └── v2 ├── hash1 │ └── guid1.mmif └── hash1.json

keighrim · 2024-08-06T14:46:04Z

prototype/storage_api.py

+    pipeline = pipeline_from_param_json(data)
+    # get number of views for rewind if necessary
+    num_views = len(data['pipeline'])
+    guid = data.get('guid')


I think you mentioned you want to work on (or already on it?) multi-guid query scenario. As mentioned in other comment, how about also adding zero-guid query, to return just the full directory path?

…uration, changed route names, added "zero-guid" functionality to return just the absolute path for the pipeline, and changed directory level for storing parameter hashes (from hash dir to version number dir)

keighrim · 2024-08-14T15:01:45Z

A few additional suggestions after using it for uploading in recent days.

zero-guid scenario and "rewind" feature: because of the rewind feature, I planned a regular "garbage collection" process to clean mmif files from non-terminal directories. However this needs to be more thought through since when the garbage collection is in place, the zero-guid query can return an empty directory.
overwrite: what should happen when a payload for an upload request conflicts with an existing file?
1. Can we just blindly overwrite?
2. Can we just blindly reject the upload?
3. Should we conduct some sort of "deep-diff" between two MMIF files and decide?
document locations: we need to decide whether we want to allow users (uploaders) to use file:// scheme for document location (which is not persistent and possibly only available on the user's personal device), or only allow baapb:// locations for consistency and reproducibility.

…e_api

…resolver

…eprint) such that it can run in the same app as the baapb resolver.

Added subdirectory "prototype" containing protype/basic idea for mmif…

3a251dc

… storage api

keighrim requested changes Jul 11, 2024

View reviewed changes

prototype/storage_api.py Outdated Show resolved Hide resolved

prototype/storage_api.py Outdated Show resolved Hide resolved

prototype/storage_api.py Outdated Show resolved Hide resolved

jyoune added 7 commits July 16, 2024 16:57

Added config.yml, which makes it easier to specify/ change the local …

169747d

…storage path

Made several requested changes including involving the runtime parame…

cd77175

…ters (hashed) in the nested directory structure.

Fixed missing line to open file when dumping json.

c959269

Added example "test.json" file for testing with retrieval api.

17ea137

Updated storage_api.py to include retrieval method which a json-forma…

9ed534c

…tted representation of the pipeline you're searching for

Added a json to test the rewind function under retrieval.

a1730fb

Added rewind functionality (occurs if submitted pipeline does not mat…

38f9195

…ch a full filepath within the database) and cleaned up other parts of the code.

keighrim requested changes Aug 6, 2024

View reviewed changes

jyoune added 5 commits August 7, 2024 15:43

Added file to test retrieving multiple mmifs at once

2cebb4e

Added functionality to retrieve multiple mmifs with one request

ca2cafc

Added json file to test for zero-guid query functionality

214cf04

Addressed various issues including going from .yml to .env for config…

18b5e30

…uration, changed route names, added "zero-guid" functionality to return just the absolute path for the pipeline, and changed directory level for storing parameter hashes (from hash dir to version number dir)

Some cleanup

b895d9f

jyoune added 3 commits August 19, 2024 13:48

Some changes to env sample and requirements.txt in relation to storag…

c58021e

…e_api

Made changes to include the storage_api in the same app as the baapb …

86ec859

…resolver

Moved storage_api to api folder and made changes (e.g from app to blu…

7b3a386

…eprint) such that it can run in the same app as the baapb resolver.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype for storage API #7

Prototype for storage API #7

jyoune commented Jul 10, 2024

keighrim left a comment

keighrim commented Jul 12, 2024

keighrim left a comment

keighrim Aug 6, 2024

keighrim Aug 6, 2024

keighrim Aug 6, 2024

keighrim Aug 6, 2024

keighrim Aug 6, 2024

keighrim commented Aug 14, 2024

		return {"message": "Storage api for pipelined mmif files"}


		@app.route("/upload_mmif/", methods=["POST"])

Prototype for storage API #7

Are you sure you want to change the base?

Prototype for storage API #7

Conversation

jyoune commented Jul 10, 2024

keighrim left a comment

Choose a reason for hiding this comment

keighrim commented Jul 12, 2024

keighrim left a comment

Choose a reason for hiding this comment

keighrim Aug 6, 2024

Choose a reason for hiding this comment

keighrim Aug 6, 2024

Choose a reason for hiding this comment

keighrim Aug 6, 2024

Choose a reason for hiding this comment

keighrim Aug 6, 2024

Choose a reason for hiding this comment

keighrim Aug 6, 2024

Choose a reason for hiding this comment

keighrim commented Aug 14, 2024